\[~\]
\[~\]
We are provided with some preprocessed data coming from Autism Brain Image Data Exchange Project. This dataset is partitioned in two subsets: the Autism Spectrum Disorder (ASD) and the Typically Developed (TD) sets.
\[~\]
\[~\]
As shown in the image above, each subset contains the records of 12 patients. A sample of 145 records is associated to each person. Formally, this sample can be treated as an IID sample defined as:
\[ D^{(i)}_n = \{X_1, ..., X_n\} \sim f_X(\cdot) \hspace{1em} \text{ for a certain } f_X \text{, having } n=145 \]
Each \(X_i\) is defined as a random vector \(X_i = [X_i(1), ..., X_i(\mathtt{D})]^T \in \mathbb{R}^\mathtt{D}\), where \(\mathtt{D}=116\) is the number of the features, i.e the number of the Regions of Interest (ROIs) of the person’s brain. We also assume that each patient within a group can be himself treated as an independent sample from a certain \(f_D(\cdot)\). Thus, moving bottom-up in the data tree, each group formally corresponds to:
\[ D_{12} = \{D^{(1)}_n, D^{(2)}_n, ..., D^{(12)}_n\} \]
It is worth to note that the indipendence assumptions at the two levels of the data tree are quite crucial. Thanks to them, we have different ways of pooling the data:
From this point on, according to [1], we follow the second approach, computing the correlation matrix \(\big(\hat{\rho}_{j,k}\big)^{(i)} \in [-1,1]^{\mathtt{D}\times \mathtt{D}}\) associated to each \(D^{(i)}_n\), \(i=1,...,12\), and taking the mean of them; by the IID property of \(\{D^{(1)}_n, ..., D^{(12)}_n\}\), we have that \(\Big\{\big(\hat{\rho}_{j,k}\big)^{(i)}\Big\}_{i=1}^{12}\) is still independent, being function of each \(D^{(i)}_n\). This choice has many advantages, which will be pointed out further in the discussion.
As pointed out in [2], the Pearson correlation coefficient is non-additive, then it is not mathematically meaningful to take the mean of them as they are. Hence, before averaging, we apply the Fisher Z-transform in order to smoothly map the interval \([-1,1]\) into \(\mathbb{R}\):
\[ \Big\{\hat{Z}^{(i)}_{j,k}\Big\}_{i=1}^{12} = \Big\{atan\left(\hat{\rho}^{(i)}_{j,k}\right)\Big\}_{i=1}^{12} \hspace{1em} \text{where} \hspace{.5em} j,k \in \{1,...,\mathtt{D}\},\hspace{.4em} j\neq k \]
Recall that, for each \(\hat{Z}^{(i)}_{j,k}\), holds:
\[ \hat{Z}^{(i)}_{j,k} \hspace{.4em} \dot{\sim} \hspace{.4em} N_1\Big(atan(\hat{\rho}^{(i)}_{j,k}), \frac{1}{n - 3}\Big), \hspace{1em} n=145 \]
Finally, we take:
\[ \overline{Z}_{12}(j,k) = \frac{1}{12}\sum_{i=1}^{12}\hat{Z}^{(i)}_{j,k} \text{ ,} \]
remembering that \(\hat{Z}^{(i)}_{j,k}\) are \(IID\), being function of \(\hat{\rho}^{(i)}_{j,k}\).
Here the convenience of this approach becomes evident; indeed, being a sample mean, the estimator is unbiased under the assumption of an (unknown) identical distribution \(f_D\); by the CLT, the sample mean is asymptotically normal so that:
\[
\frac{\overline{Z}_{12}(j,k) - Z_{j,k}}{\hat{\sigma}_{j,k}/\sqrt{12}} \hspace{.4em} \dot{\sim} \hspace{.4em} N(0,1),
\] where \(\hat{\sigma}_{j,k} = \frac{1}{\sqrt{145 - 3}}\). In building the asymptotic confidence interval for \(Z_{j,k}\), these results reveal pretty relevant.
Before moving forward, for the sake of completeness, we discuss the other possible pooling approaches.
Concatenate the data related to all the patients within a group leads to:
\[ \hat{Z}^{(i)}_{j,k} \hspace{.4em} \dot{\sim} \hspace{.4em} N_1\Big(atan(\hat{\rho}^{(i)}_{j,k}), \frac{1}{n - 3}\Big), \hspace{1em} n=145 \cdot 12, \]
which is statistically more efficient than the chosen estimator. Nevertheless, in building the asymptotic confidence interval problems arise: if the distribution assumptions don’t strongly hold, this approach is less robust than the chosen one.
Eventually, taking the average of the patients before computing the correlation coefficients would hield to a more complicated and potentially less accurate solution.
\[~\]
\[~\]
After loading the data, we define a function called get.cor.ROIs() in order to return a dataframe of correlations between all possible combinations between the ROIs. Inside this function we transform the correlation values with the Fisher Z-transform as defined in the previous section.
\[~\]
# define a function in order to catch each person and return the correlation matrix between ROIs
get.cor.ROIs <- function(...){
args <- list(...)
if(!is.null(args[["person"]])) {
args$frame <- args[["group"]][[args[["person"]]]]
}
n <- ncol(args[["frame"]])
nms <- names(args[["frame"]])
# takes efficiently the combinations, this is useful for large dataset!
V1 <- rep(nms[1:(n-1)], seq(from=n-1, to = 1, by = -1))
V2 <- unlist(lapply(1:(n-1), function(i)(nms[(i+1):n])))
corr.ROIs <- data.frame(ROI_1=V1, ROI_2=V2) # create the correlation in which I will insert the values
corr.ROIs$z.pearson.corr <- apply(corr.ROIs, 1, function(row) {
atanh(cor(args[["frame"]][row["ROI_1"]], args[["frame"]][row["ROI_2"]])) # takes the sets of columns; apply the Z-transform to the correlation matrix
})
return(corr.ROIs)
}
\[~\]
Recall, for these 12 people of each group, the function defined above. We put into the call function the type of group and the corresponding i-th person.
\[~\]
# create the matrix correlations for all patients
for(person in names(asd_sel))
assign(paste("z.trans.", person, sep=""), get.cor.ROIs(group = asd_sel, person = person))
for(person in names(td_sel))
assign(paste("z.trans.", person, sep=""), get.cor.ROIs(group = td_sel, person = person))
\[~\]
After computing the correlation dataframes, for all the patients within each group, we pool the data as hinted at the beginning.
\[~\]
# create a unique matrix for each group
unique.matrix <- function(group) {
mtx <- z.trans.caltech_0051472[c("ROI_1", "ROI_2")] # create matrix with combinations
mtx$cor.mean <- apply(mtx, 1, function(row) {
values <- c()
for(person in names(group)) {
frame <- get(paste("z.trans.", person, sep="")) # match the address of the string with the real variable
elem <- frame[(frame[["ROI_1"]] == row["ROI_1"]) & (frame[["ROI_2"]] == row["ROI_2"]), "z.pearson.corr"] # select the correlation
values <- c(values, elem)
}
mean(values) # take the mean!
})
return(mtx)
}
\[~\]
We finally get the average correlation matrix for the groups ASD and TD:
\[~\]
# call the creation of unique matrix
asd_sel.dt <- unique.matrix(asd_sel); head(asd_sel.dt)
## ROI_1 ROI_2 cor.mean
## 1 2001 2002 0.51188381
## 2 2001 2101 0.02486222
## 3 2001 2102 -0.21203874
## 4 2001 2111 -0.05246567
## 5 2001 2112 -0.19052649
## 6 2001 2201 0.14638295
td_sel.dt <- unique.matrix(td_sel); head(td_sel.dt)
## ROI_1 ROI_2 cor.mean
## 1 2001 2002 0.416005628
## 2 2001 2101 0.086652336
## 3 2001 2102 -0.334109327
## 4 2001 2111 0.007217533
## 5 2001 2112 -0.150770771
## 6 2001 2201 0.164410598
\[~\]
\[~\]
Before computing the estimated association graphs, we show some plots to investigate the properties of the averaged correlation matrices associated to ASD and TD. According to the assignment introduction, two ROIs are strictly defined as co-activated if they are positively correlated: the greater the Pearson coefficient, the stronger the co-activation between them. Hence, the heatmaps below show the 80 most positively correlated ROIs within the two groups:
\[~\]
halfHeatmap(asd_sel.dt, title = "Top 80 positively correlated ROIs in ASD group (Z-scale)")
halfHeatmap(td_sel.dt, title = "Top 80 positively correlated ROIs in TD group (Z-scale)")
\[~\]
Looking at the plots and at the images above, it is evident that the majority of the co-activated ROIs have near positions in the brain. As an example, the ROIs named 5001 and 5002 are strongly positively correlated in both the groups.
\[~\]
Moving foreward, the histograms below show the distribution of the correlation coefficients in the Z-scale of the groups.
\[~\]
\[~\]
As expected, the values are approximately distributed as a Gaussian and the curve is slighly skewed on the left.
\[~\]
\[~\]
Start to find the quantile value for these two groups. Since working on the Z-scale, the threshold is also in the Z-scale.
\[~\]
z.t <- apply(rbind(asd_sel.dt["cor.mean"], td_sel.dt["cor.mean"]), 2, quantile, probs=0.8)
paste("the threshold at the 80th percentile for the subjects is: ", round(as.numeric(z.t), 3), sep = "")
## [1] "the threshold at the 80th percentile for the subjects is: 0,125"
\[~\]
\[~\]
We compute the confidence interval for each \(\overline{Z}_{12}(j,k)\). Starting from:
\[~\]
\[ \frac{\overline{Z}_{12}(j,k) - Z_{j,k}}{\hat{\sigma}_{j,k} / \sqrt{12}}, \\ \text{ where } \hat{\sigma}_{j,k} = \frac{1}{\sqrt{145 - 3}} \]
\[~\]
and applying the Bonferroni’s correction, this is one of the typical method used for the problem of multiple comparisons. Let \(\textit{H}_{1}, ..., \textit{H}_{m}\) a collection of null hypotheses and the \(\rho_{j, k}\) correlations values that we would want to compare to the alpha level (rejection region). The familywise error rate (FWER) is the probability to reject at least one true hypothesis \(\textit{H}_{0}\), in this way I can commit the first type of error \(\alpha\). In our case as we can see below, all is determined by the m (number of hypotheses):
\[~\]
\[ \frac{\alpha}{m}, \hspace{1em} \text{ where } \hspace{.5em} m = \begin{pmatrix} \mathtt{D} \\ 2 \end{pmatrix}, \hspace{.5em} \mathtt{D}=116 \]
\[~\]
we end up with:
\[~\]
\[ C_{12}^{Z(j,k)}\Big(\frac{\alpha}{m}\Big) = \bigg[\overline{Z}_{12}(j,k) \mp z_{\alpha/2m} \cdot \frac{\hat{\sigma}_{j,k}}{\sqrt{12}}\bigg] \]
\[~\]
conf.int <- function(dt, m = 1, alpha = 0.05) { # m : Family-Wise Error Rate correction (Bonferroni)
# Asymptotic variance
n.samp <- 145 # number of IID samples on which we computed the correlation matrix
sigma <- 1 / sqrt(n.samp - 3) # standard deviation of the (j,k)-th Z-transformed Pearson coefficent
n.p <- 12 # number of people in each of the two groups
se <- sigma / sqrt(n.p) # standard error of the estimator Z12
# Confidence interval
cint <- sapply(dt$cor.mean, function(z.i) {
list(confidence_interval = c(lb = z.i - se * qnorm(1 - alpha / (2*m)),
ub = z.i + se * qnorm(1 - alpha / (2*m))))
})
return(cint)
}
\[~\]
Finally, the confidence intervals for the first three ROI couples are:
\[~\]
## Compute the confidence interval
m <- (116 * 115) / 2 # number of intervals
asd_sel.dt$cint <- conf.int(asd_sel.dt, m = m); asd_sel.dt[1:3,]
## ROI_1 ROI_2 cor.mean cint
## 1 2001 2002 0,51188381 0,4033775, 0,6203901
## 2 2001 2101 0,02486222 -0,08364405, 0,13336849
## 3 2001 2102 -0,21203874 -0,3205450, -0,1035325
td_sel.dt$cint <- conf.int(td_sel.dt, m = m); td_sel.dt[1:3,]
## ROI_1 ROI_2 cor.mean cint
## 1 2001 2002 0,41600563 0,3074994, 0,5245119
## 2 2001 2101 0,08665234 -0,02185393, 0,19515860
## 3 2001 2102 -0,33410933 -0,4426156, -0,2256031
\[~\]
Since the threshold is \(z_t=\) 0.125 and two ROIs \(j,k\) are considered connected if \([-z_t, z_t] \cap C_{12}^{Z(j,k)}\Big(\frac{\alpha}{m}\Big)=\emptyset\), we expect to have a connection in the estimated graph between 2001 and 2002, whereas not to have a connection between 2001 and 2101.
\[~\]
Just as an example, we visualize the confidence interval for \(Z(j=2001,k=2002)\) in the ASD and TD groups:
\[~\]
At this point, we compute the estimated association graphs, applying the condition just mentioned. To be more precise, the (conservative) null hypothesis \(H_0^{j,k}\) states that, under a certain threshold, the (Z-transformed) correlation coefficient \(z_{j,k}\) is not relevant and thus ROIs \(j,k\) are not correlated. The null hypothesis is hence rejected if and only if, at a certain level \(\alpha\) of confidence, the correlation value is not in the threshold interval. In this case, ROIs \(j,k\) are connected with an edge.
\[~\]
## Estimate the adjacency matrix given the Z-transformed Pearson correlation coefficient
get.est.adj.mat <- function(dt, dec.rule, t) {
# create the adj matrix
nameVals <- sort(unique(unlist(dt[1:2]))) # set up storage matrix, get names for row and columns
# construct 0 matrix of correct dimensions with row and column names
adj_mat <- matrix(0, length(nameVals), length(nameVals), dimnames = list(nameVals, nameVals))
# fill in the matrix with matrix indexing on row and column names
adj_mat[as.matrix(dt[c("ROI_1", "ROI_2")])] <- 0
# put edge according to the decision rule
for(idx in rownames(dt)){
if( dec.rule(dt, idx, t) ) {
adj_mat[as.character(dt[idx, "ROI_1"]), as.character(dt[idx, "ROI_2"])] <- 1
}
}
return(adj_mat)
}
## check if the two intervals int1 and int2 are overlapping
are.joint <- function(int1, int2) return((int1[1] <= int2[2]) && (int2[1] <= int1[2]))
## check the simple threshold condition
check.threshold <- function(dt, idx, t) {
val <- abs(dt[idx, "cor.mean"])
return(val >= t)
}
## check the confidence interval condition
check.conf.int <- function(dt, idx, t) {
c.int <- dt[idx, "cint"]$confidence_interval
t.int <- c(-t, t)
return(!are.joint(c.int, t.int))
}
\[~\]
# Adjacency matrix of the estimated graph with the Bonferroni's correction
adj_mat_asd_bonf <- get.est.adj.mat(asd_sel.dt, check.conf.int, z.t)
adj_mat_td_bonf <- get.est.adj.mat(td_sel.dt, check.conf.int, z.t)
# Estimated graph
G.asd.bonf <- graph_from_adjacency_matrix(adj_mat_asd_bonf, mode = "undirected")
G.td.bonf <- graph_from_adjacency_matrix(adj_mat_td_bonf, mode = "undirected")
\[~\]
\[~\]
\[~\]
\[~\]
The association graph taking into only the threshold is:
\[~\]
# Adjacency matrix of the estimated graph with the Bonferroni's correction
adj_mat_asd_thr <- get.est.adj.mat(asd_sel.dt, check.threshold, z.t)
adj_mat_td_thr <- get.est.adj.mat(td_sel.dt, check.threshold, z.t)
# Estimated graph
G.asd.thr <- graph_from_adjacency_matrix(adj_mat_asd_thr, mode = "undirected")
G.td.thr <- graph_from_adjacency_matrix(adj_mat_td_thr, mode = "undirected")
\[~\]
Plot the graphs with the threshold, without adjustment:
\[~\]